from psaw import PushshiftAPI
import datetime
import pandas as pd
import os
import re
import numpy as np
from tqdm import tqdm
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk import word_tokenize
import string
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.dates as mdates
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import seaborn as sns
import requests
from collections import Counter
import io
import networkx as nx ; import netwulf as nw
from datetime import datetime, timezone, date
import datetime as dt
from netwulf import visualize
from networkx import Graph
import random
import shifterator as sft
import networkx.algorithms.community as nx_comm
from community import community_louvain
from ast import literal_eval
import networkx as nx ; import netwulf as nw
from netwulf import visualize
from networkx import Graph
import random
import networkx.algorithms.community as nx_comm
from community import community_louvain
tqdm.pandas()
All group members have contributed to the production of all elements of the project, including the explainer notebook and the website. Below, we outline who was the main responsible for the main parts of the project:
def setup_mpl():
#mpl.rcParams['font.family'] = 'Helvetica Neue'
mpl.rcParams['font.size'] = '15'
mpl.rcParams['lines.linewidth'] = 3
setup_mpl()
For this project, we chose to work with data from the r/politics subreddit, an online forum with 8 million members "for current and explicitly political U.S. news." according to the rules stated on the site.
Visitors at r/politics will quickly notice that the majority of the submissions are by users posting links to news articles published on news media sites like CNN or The Huffington Post. The headlines of these linked articles are then shown on r/politics as the titles of the submissions. Other users can then comment on the linked article, which is what ultimately constitutes the actual user-generated content on the site.
We focused our data extraction to only include submissions from r/politics that fulfilled the following criteria:
Submission variables
The downloaded submissions would be structured in a Pandas dataframe containing the following variables for each submission in its respective columns:
Our query to extract comments from the r/politics subreddit was also focused to only include comments fulfilling the following criteria:
Comments variables
As stated, we would download the associated comments section for all the downloaded submissions. Similarly to the submissions, the comments would be structured in a Pandas dataframe containing the following variables for each comment in its respective columns:
Reasons for choosing this particular data set
The polling data used in this project is collected from FiveThirtyEight. The data set contains polling results from our objective period, generated by different polling firms like YouGov and SurveyMonkey. Each poll has been generated over a smaller time window, typically spanning a couple of days, and with varying number of respondents. By the end of the polls, the percentage of respondents who have voted for a specific candidate has been registered. We have not investigated further how the individual polling firms have performed the polls, but we put our faith in their ability to make polls representative of the American population.
Our goal is for the end user's of our website to find the results of our analyses interesting.
api = PushshiftAPI()
my_subreddit = "politics"
query = "Trump | Biden "
date1 = int(datetime.datetime(2020,10,1).timestamp())
date2 = int(datetime.datetime(2020,11,3).timestamp())
gen = api.search_submissions(num_comments= '>5',
subreddit=my_subreddit,
after=date1,
before=date2,
q=query
)
results = list(gen)
column_names = ['title', 'id', 'author', 'num_comments', 'url']
sub_data = pd.DataFrame(
{
column_names[0] : [submission.d_[column_names[0]] for submission in results],
column_names[1] : [submission.d_[column_names[1]] for submission in results],
column_names[2] : [submission.d_[column_names[2]] for submission in results],
column_names[3] : [submission.d_[column_names[3]] for submission in results],
column_names[4] : [submission.d_[column_names[4]] for submission in results]
},
index = [submission.d_['created_utc'] for submission in results])
sub_data.index = pd.to_datetime(sub_data.index, unit='s')
To ensure that each submissions is unambigously related to either one of the candidates, we simply remove all submissions containing both "Trump" and "Biden".
# List to contain indices of subs in df with a title containing both "Trump" and "Biden"
TB = []
for i in range(len(sub_data['title'])):
if (re.search('Trump', sub_data['title'][i])) and (re.search('Biden', sub_data['title'][i])):
TB.append(i)
else:
continue
sub_data = sub_data.drop(sub_data.index[TB])
Now we're ready to download the associated comments sections for each of the remaining submissions.
date3 = int(datetime.datetime(2020,11,10).timestamp())
comments = []
for link_id in tqdm(subs_df['id']):
gen = api.search_comments(subreddit=my_subreddit,
link_id=link_id,
before=date3)
comment_sec = list(gen)
comments += comment_sec
column_names = ['id', 'link_id', 'author', 'parent_id', 'body']
com_data = pd.DataFrame(
{
column_names[0] : [comment.d_[column_names[0]] for comment in comments],
column_names[1] : [comment.d_[column_names[1]] for comment in comments],
column_names[2] : [comment.d_[column_names[2]] for comment in comments],
column_names[3] : [comment.d_[column_names[3]] for comment in comments],
column_names[4] : [comment.d_[column_names[4]] for comment in comments]
},
columns= column_names, index = [comment.d_['created_utc'] for comment in comments])
com_data.index = pd.to_datetime(com_data.index, unit='s')
# This cell is to ensure that the "body" column of the comments data only contains strings
com_data["body"] = com_data["body"].astype("str")
Determining the mentioned politician
We need to determine whether the collected submissions are relating to either Trump or Biden, as we have already removed all the submissions including both names. We do this simply by searching the title of the submissions for these names and adding a "politician" variable to each submission stating which of the politians are mentioned.
def determine_politician_subs(title):
if (re.search('Trump', title)):
return 'Trump'
else:
return 'Biden'
tqdm.pandas()
sub_data['politician'] = sub_data['title'].progress_apply(determine_politician_subs)
Removing deleted comments
Some of the comments have been removed after being posted, so we'll do some cleaning first by filtering out the comments where author = "[deleted]", which will do the job. Similarly, we found that a large part of the comments were made by moderator robots reminding real redditors to behave in accordance with the subreddit rules. Comments made by these bots are also removed.
com_data = com_data.drop(com_data[com_data['author'] == '[deleted]'].index)
com_data = com_data.drop(com_data[com_data['author'] == 'AutoModerator'].index)
com_data = com_data.drop(com_data[com_data['author'] == 'PoliticsModeratorBot'].index)
Removing comments by authors with less than 50 comments
We also remove all comments by authors who have posted less than 50 comments in total. The reason for this being that we would like to have a more solid foundation on which to infer the political convictions of the redditors, which would not be achieved if we had only very few comments made by them.
com_data = com_data.groupby('author').filter(lambda x : len(x)>=50)
Finding parent authors for the comments
To later build our network of redditors, we will populate the comments dataframe with the names of the parent authors.
This will create some [deleted] and NaN values which we will delete.
comment_authors = dict(zip(com_data.id, com_data.author))
parent = dict(zip(com_data.id, com_data.parent_id))
submission_authors = dict(zip(sub_data.id, sub_data.author))
def find_parent_author(comment_id):
parent_id = parent[comment_id]
if parent_id[:3] == 't1_':
return comment_authors.get(parent_id[3:], None)
elif parent_id[:3] == 't3_':
return submission_authors.get(parent_id[3:], None)
com_data["parent_author"] = com_data.id.apply(find_parent_author)
com_data = com_data.drop(com_data[com_data['parent_author'] == '[deleted]'].index) # Remove deleteds
com_data = com_data.drop(com_data[com_data['parent_author'].isna()].index) # Remove NaN's
Tokenization
As we will also be looking into producing other textual analyses than the VADER sentiment scores, like wordclouds, we will also need to do some tokenization ourselves.
For the body of each comment, we will do the following steps:
The results of this preprocessing will be a new column in the comments dataframe containing the cleaned tokens of the text body.
# Define stop words to also include punctuation
stop = set(stopwords.words('english') + list(string.punctuation))
# Function to tokenize and clean the text of each submission
def clean_tokens(text):
tokens = nltk.word_tokenize(text)
clean_tokens = [re.sub(r'http\S+', '', str(i)).lower() for i in tokens if str(i).isalpha()]
clean_tokens = [t for t in clean_tokens if t not in stop]
return clean_tokens
com_data["tokens"] = com_data.progress_apply(lambda x: clean_tokens(x["body"]), axis=1)
Determining whether the comments are "Trump" or "Biden" -related
For each of the collected comments, we need to know whether the comment relates to Trump or Biden. Our plan to achieve this is to first assign a "politician" variable to all comments that matches the "politician" value of the submission to which the comments was made.
def determine_politician_coms(link_id):
#Get the id of the original submission
real_id = link_id[3:]
# Find the politician value of the corresponding submission
sub_politician = sub_data.loc[sub_data['id'] == real_id]['politician']
sub_politician = sub_politician.values[0]
return sub_politician
com_data['politician'] = com_data['link_id'].progress_apply(determine_politician_coms)
This assignment of "politician" values i very naive, as it does not take into account the possibility of the actual comments mentioning either of the politicians. To overcome this, we will prune the comments section tree so that whenever a comment mentions a different politician than what is currently stated as its "politician" value, we will remove this comment along with all comments made to that comment - the so-called children of the comment. This way, we ensure an unambiguous picture of which politician each comment relates to.
This procedure requires us to also know the children of each comment, which find below:
# A list of all the id's in the comments datafram
com_ids = list(com_data['id'].values)
# We create a dictionary with keys equal to the id's in of the comments and values equal to their respective "children" comments
children_dict = {com_id: [] for com_id in com_ids}
for com_id in tqdm(com_ids):
parent_id = com_data.loc[com_data['id'] == com_id]['parent_id']
parent_id = parent_id.values[0]
parent_id = parent_id[3:]
if parent_id in com_ids:
children_dict[parent_id].append(com_id)
# Append the children comments to the comments dataframe
def determine_children(com_id):
return children_dict[com_id]
com_data['children_comments'] = com_data['id'].progress_apply(determine_children)
Now we will figure out whether each comment itself mentions "Trump" or "Biden".
def comment_mentions_Trump(com_id):
mentions_Trump = False
tokenized_body = " ".join(com_data.loc[com_data['id'] == com_id]['tokens'].values[0])
if (re.search('trump', tokenized_body)):
mentions_Trump = True
return mentions_Trump
def comment_mentions_Biden(com_id):
mentions_Biden = False
tokenized_body = " ".join(com_data.loc[com_data['id'] == com_id]['tokens'].values[0])
if (re.search('biden', tokenized_body)):
mentions_Biden = True
return mentions_Biden
com_data['mentions_Trump'] = com_data['id'].progress_apply(comment_mentions_Trump)
com_data['mentions_Biden'] = com_data['id'].progress_apply(comment_mentions_Biden)
comments_to_delete = []
# Function to find the comments to be deleted due to amgious politician relation.
def find_coms_to_be_deleted(com_id):
if com_id not in comments_to_delete:
mentions_Biden = com_data.loc[com_data['id'] == com_id]['mentions_Biden']
mentions_Biden = mentions_Biden.values[0]
mentions_Trump = com_data.loc[com_data['id'] == com_id]['mentions_Trump']
mentions_Trump = mentions_Trump.values[0]
politician = com_data.loc[com_data['id'] == com_id]['politician']
politician = politician.values[0]
children = com_data.loc[com_data['id'] == com_id]['children_comments']
children = children.values[0]
if (politician == 'Biden' and mentions_Trump) or (politician == 'Trump' and mentions_Biden):
comments_to_delete.append(com_id)
if len(children):
for child in children:
comments_to_delete.append(child)
find_coms_to_be_deleted(child)
for com_id in tqdm(com_ids):
find_coms_to_be_deleted(com_id)
comments_to_delete= list(set(comments_to_delete))
for com_id in tqdm(comments_to_delete):
com_index = com_data.id[com_data.id == com_id].index
com_data = com_data.drop(com_index)
We'll look into the distribution of the submissions mentioning either Trump or Biden.
biden_trump_subs_dist = pd.DataFrame(sub_data.groupby("politician").count()["title"])
biden_trump_subs_dist = biden_trump_subs_dist.rename(columns = {"title" : " "})
#biden_trump_subs_dist
fig = plt.figure(dpi=140)
subs_dist_plot = biden_trump_subs_dist.plot.pie(y = " ", title = "Distribution of candidates mentioned in submissions", legend = False, colors = ["blue", "red"], shadow = False, autopct='%1.1f%%', explode=(0, 0.1), ax=plt.gca())
subs_dist_plot.figure.savefig(r"C:\Users\Lasse\Desktop\DTU\6. semester\Social informatik\Exam project local\Website new\Website-template\static\images\subs_distribution_NEW_2.svg", bbox_inches='tight', dpi=400)
print("Total number of submissions: " + str(len(sub_data)))
print("Total number of submissions mentioning Trump: " + str(len(sub_data.loc[sub_data["politician"] == "Trump"])))
print("Total number of submissions mentioning Biden: " + str(len(sub_data.loc[sub_data["politician"] == "Biden"])))
print("Number of unique authors: " + str(len(sub_data["author"].unique())))
print("Average number of submissions per author: " + str(round(sub_data.groupby("author").count()["title"].mean(),2)))
print("Average number of comments per submission: " + str(round(sub_data["num_comments"].mean(), 1)))
Total number of submissions: 9018 Total number of submissions mentioning Trump: 7205 Total number of submissions mentioning Biden: 1813 Number of unique authors: 2912 Average number of submissions per author: 3.1 Average number of comments per submission: 150.5
sub_data = sub_data[sub_data.index > "2020-10-01"]
daily_subs = sub_data['id'].resample('D').count()
MyFmt = mdates.DateFormatter('%d-%m-%Y')
#Resample number of comments per day.
#Convert to dataframe
daily_subs_df = pd.DataFrame(
{
'Daily submissions' : daily_subs.values
}, index = daily_subs.index)
MA_subs = daily_subs_df["Daily submissions"].rolling('3D').mean()
#Plot the moving average on top!
#fig, ax = plt.subplots(figsize=(15,5), dpi=400)
fig, ax = plt.subplots(figsize=(15,5), dpi=100)
ax.xaxis.set_major_locator(plt.MaxNLocator(5))
ax.plot(daily_subs.index, daily_subs.values, alpha=0.5, label="Submissions")
#ax.plot(MA_subs.index, MA_subs.values, color='k', label="3-day rolling average")
#ax.locator_params(axis='x', nbins=5)
ax.axvline(x=daily_subs.index[-1], label='Day before election day', color='orange', alpha=0.5)
#ax.plot(rolled_series.index, rolled_series.values, color='k', label="1 week rolling average")
ax.set_ylabel('Number of submissions', size = 18)
#ax.set_yscale('log')
ax.legend()
plt.grid(True, alpha=0.5)
ax.xaxis.set_major_formatter(MyFmt)
plt.title("Daily submissions on r/politics mentioning 'Trump' or 'Biden'")
plt.savefig(r'C:\Users\Lasse\Desktop\DTU\6. semester\Social informatik\Exam project local\Website new\Website-template\static\images\submissions_per_day_new.svg', bbox_inches='tight', dpi=400)
daily_coms = com_data['id'].resample('D').count()
MyFmt = mdates.DateFormatter('%d-%m-%Y')
#Resample number of comments per day.
#Convert to dataframe
daily_coms_df = pd.DataFrame(
{
'Daily comments' : daily_coms.values
}, index = daily_coms.index)
MA_coms = daily_coms_df["Daily comments"].rolling('3D').mean()
#Plot the moving average on top!
#fig, ax = plt.subplots(figsize=(15,5), dpi=400)
fig, ax = plt.subplots(figsize=(15,5), dpi=100)
ax.xaxis.set_major_locator(plt.MaxNLocator(6))
ax.plot(daily_coms.index, daily_coms.values, alpha=0.5, label="Comments")
#ax.plot(MA_subs.index, MA_subs.values, color='k', label="3-day rolling average")
#ax.locator_params(axis='x', nbins=5)
ax.axvline(x=daily_subs.index[-1], label='Day before election day', color='orange', alpha=0.5)
#ax.plot(rolled_series.index, rolled_series.values, color='k', label="1 week rolling average")
ax.set_ylabel('Number of comments', size = 18)
#ax.set_yscale('log')
ax.legend()
plt.grid(True, alpha=0.5)
ax.xaxis.set_major_formatter(MyFmt)
plt.title("Daily comments on r/politics relating to the collected submissions")
plt.savefig(r'C:\Users\Lasse\Desktop\DTU\6. semester\Social informatik\Exam project local\Website new\Website-template\static\images\comments_per_day_new.svg', bbox_inches='tight', dpi=400)
print("Total number of comments: " + str(len(com_data)))
print("Average length of comments: " + str(com_data['body'].apply(len).mean()))
print("Total number of comments related to Trump: " + str(len(com_data.loc[com_data["politician"] == "Trump"])))
print("Total number of comments related to Biden: " + str(len(com_data.loc[com_data["politician"] == "Biden"])))
print("Number of unique comment authors: " + str(len(com_data["author"].unique())))
print("Average number of comments per author: " + str(round(com_data.groupby("author").count()["id"].mean(),2)))
#print("Average number of comments per submission: " + str(round(com_data["num_comments"].mean(), 1)))
Total number of comments: 91672 Average length of comments: 138.2530325508334 Total number of comments related to Trump: 77030 Total number of comments related to Biden: 14642 Number of unique comment authors: 1553 Average number of comments per author: 59.03
url = "https://raw.githubusercontent.com/JaQtae/SocInfo2022/FinalProject/Data/president_polls_historical.csv"
download_poll = requests.get(url).content
poll_data = pd.read_csv(io.StringIO(download_poll.decode('utf-8')), parse_dates = ['end_date'], sep=',').set_index('end_date')
poll_data.index = poll_data.index.rename('dates')
poll_data = poll_data.loc[poll_data.index >= "2020-10-01"]
poll_data_reduced = poll_data[['candidate_name','pct']]
poll_data_biden = poll_data_reduced.loc[poll_data_reduced['candidate_name'] == 'Joe Biden']
poll_data_biden_daily_mean = poll_data_biden['pct'].resample('D').mean()
poll_data_trump = poll_data_reduced.loc[poll_data_reduced['candidate_name'] == 'Donald Trump']
poll_data_trump_daily_mean = poll_data_trump['pct'].resample('D').mean()
MyFmt = mdates.DateFormatter('%d-%m-%Y')
poll_data_trump_daily_mean_df = pd.DataFrame(
{
'Poll pct' : poll_data_trump_daily_mean.values
}, index = poll_data_trump_daily_mean.index)
MA_poll_data_trump = poll_data_trump_daily_mean_df["Poll pct"].rolling('3D').mean()
poll_data_biden_daily_mean_df = pd.DataFrame(
{
'Poll pct' : poll_data_biden_daily_mean.values
}, index = poll_data_biden_daily_mean.index)
MA_poll_data_biden = poll_data_biden_daily_mean_df["Poll pct"].rolling('3D').mean()
fig, ax = plt.subplots(figsize=(15,5), dpi=400)
ax.xaxis.set_major_locator(plt.MaxNLocator(7))
ax.plot(poll_data_trump_daily_mean.index, poll_data_trump_daily_mean.values, color = 'r', alpha=1, label="Average percentage of votes for Trump")
ax.plot(poll_data_biden_daily_mean.index, poll_data_biden_daily_mean.values, color = 'blue', alpha=1, label="Average percentage of votes for Biden")
ax.axvline(x=poll_data_trump_daily_mean.index[-1], label='Election day', color='green', alpha=0.5)
#ax.plot(rolled_series.index, rolled_series.values, color='k', label="1 week rolling average")
ax.set_ylabel('Avg. percentage of polling votes')
#ax.set_yscale('log')
plt.grid(True, alpha=0.5)
ax.legend()
ax.xaxis.set_major_formatter(MyFmt)
plt.title("Polling data")
plt.savefig(r'C:\Users\Lasse\Desktop\DTU\6. semester\Social informatik\Exam project local\Website new\Website-template\static\images\polling_data.svg', bbox_inches='tight', dpi=400)
We wanted to use the text in the downloaded comments dataset for two reasons:
We chose to work with two different dictionary-based sentiment scoring methods. To compute the daily sentiment for Biden and Trump, which we would compare with the collected polling data, we used the Hedonometer word list data, which basically is a huge collection of words with associated average happiness scores as judged by people on Amazon's Mechanical Turk. To create the binary partioning of the comment authors, we used the similar Valence Aware Dictionary and sEntiment Reasoner (VADER) module from the nltk.sentiment library, which has been specifically created to work with text produced on social media. One of the great features of this module is that it is quite robust in terms of the needed data cleaning and processing to function properly. Typical processing steps like tokenization and stemming as well as removing stop words are consequently not required to have the VADER module work well and provide an indication of the sentiment of a body of text.
Grouping the data by the associated politician and make it into a single corpus
author_bodies = com_data.groupby(["politician"]).apply(lambda x: x["body"].unique())
trump_corpus = " ".join(author_bodies["Trump"])
biden_corpus = " ".join(author_bodies["Biden"])
#com_data["tokens"]
# We solve this by defining the clean_tokens function below.
# Define stop words to also include punctuation
import nltk
stop = set(stopwords.words('english') + list(string.punctuation))
# Function to tokenize and clean the text of each submission
def clean_tokens(text):
tokens = nltk.word_tokenize(text)
# In the list comprehension below, we exclude URL's, stopwords and numbers as well as setting all characters to lowercase
clean_tokens = [re.sub(r'http\S+', '', str(i)).lower() for i in tokens if str(i).isalpha()]
clean_tokens = [t for t in clean_tokens if t not in stop]
return clean_tokens
TTC = clean_tokens(trump_corpus)
TBC = clean_tokens(biden_corpus)
TTC_j = " ".join(TTC)
TBC_j = " ".join(TBC)
TTC_j = TTC_j.replace(' gt', '')
TTC_j = TTC_j.replace(' amp', '')
TBC_j = TBC_j.replace(' gt', '')
TBC_j = TBC_j.replace(' amp', '')
# Implementing the VADER sentiment analysis of the comments.
#https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
def calculate_compound_sentiment_score(sentence):
# Create a SentimentIntensityAnalyzer object.
sid_obj = SentimentIntensityAnalyzer()
# polarity_scores method of SentimentIntensityAnalyzer
# object gives a sentiment dictionary.
# which contains pos, neg, neu, and compound scores.
sentiment_dict = sid_obj.polarity_scores(sentence)
# The Compound score is a metric that calculates the sum of all the lexicon ratings
# which have been normalized between -1(most extreme negative) and +1 (most extreme positive).
return sentiment_dict['compound']
tqdm.pandas()
com_data["compound_sentiment_score"] = com_data["body"].progress_apply(calculate_compound_sentiment_score)
test = com_data["body"][-2]
test = re.sub(r'\n', '', test)
test = re.sub(r'\'', '', test)
calculate_compound_sentiment_score('not bad good')
#test
NOTE: Here, we've managed to delete even more comments leaving some authors with less than five comments.
tqdm.pandas()
group = com_data.groupby(["author", "politician"])
author_comments = group["body"].unique()
text = author_comments.values[0]
author_bodies = com_data.groupby(["author", "politician"]).apply(lambda x: x["body"].unique())
# Joining the comments for each politician for each of the authors
for i in tqdm(range(len(author_bodies.values))):
author_bodies.values[i] = " ".join(list(author_bodies.values[i]))
100%|██████████| 3357/3357 [00:00<00:00, 26711.71it/s]
# List of unique comments authors
authors = list(com_data["author"].unique())
# Dict to contain the authors as keys and a dict as value: {"Biden" : compound sentiment score of the concatenated Biden-related comments, "Trump" : compound sentiment score of the concatenated Biden-related comments}
# Each list of sentiment scores contain 0 per default corresponding to a neutral compound sentiment.
author_comment_sentiment_dict = {author : {"Biden" : 0, "Trump": 0} for author in authors}
for author in tqdm(authors):
related_politicians = list(author_bodies[author].index)
for politician in related_politicians:
author_comment_sentiment_dict[author][politician] = calculate_compound_sentiment_score(author_bodies[author][politician])
100%|██████████| 1741/1741 [10:40<00:00, 2.72it/s]
For each author, we will assign a "supporter_of" attribute, corresponding to the candidate for which the concatenated related comments have the highest compound sentiment score.
author_conviction_df = pd.DataFrame({"author" : authors})
author_conviction_df["supporter_of"] = [max(author_comment_sentiment_dict[author], key=author_comment_sentiment_dict[author].get) for author in authors]
author_conviction_df.to_csv(r"C:\Users\Lasse\Desktop\DTU\6. semester\Social informatik\Exam project local\GitHub\SocInfo2022\Data\Saved dataframes etc\author_conviction_df.csv")
Biden_mask = np.array(Image.open("Illustrations/Biden_silhouette_background.png"))
wcB = WordCloud(background_color="white",
mask=Biden_mask,
contour_width=3,
repeat=True,
min_font_size=3,
contour_color='darkgreen').generate(TBC_j)
plt.figure(figsize=(10,4))
plt.imshow(wcB, interpolation='bilinear')
plt.tight_layout(pad=0)
plt.axis('off')
plt.show()
#wcB.to_file('WordCloud_Biden.png')
Trump_mask = np.array(Image.open("Illustrations/Trump_silhouette_background.png"))
wcT = WordCloud(background_color="white",
mask=Trump_mask,
contour_width=3,
repeat=True,
min_font_size=3,
contour_color='darkgreen').generate(TTC_j)
plt.figure(figsize=(10,4))
plt.imshow(wcT, interpolation='bilinear')
plt.tight_layout(pad=0)
plt.axis('off')
plt.show()
#wcT.to_file('WordCloud_Trump.png')
In order for it to work, the rule-of-thumb is to have at least 10.000 tokens per document. We group them daily over the period and plot it with the desired cutoff.
com_data['date'] = com_data.index
com_data['day'] = com_data['date'].apply(lambda x: datetime.strptime(str(x), "%Y-%m-%d %H:%M:%S").strftime('%Y-%m-%d'))
documents_per_day = com_data.groupby("day").tokens.sum()
for i in tqdm(range(len(documents_per_day))):
#Remove gt's and amp's
c1 = documents_per_day[i].count('gt')
c2 = documents_per_day[i].count('amp')
k = max(c1, c2)
for j in range(0,k):
try:
documents_per_day[i].remove('gt')
documents_per_day[i].remove('amp')
except Exception:
pass
#Remove nothings ''
c3 = documents_per_day[i].count('')
for j in range(0,c3):
try:
documents_per_day[i].remove('')
except Exception:
pass
100%|██████████████████████████████████████████████████████████████████████████████████| 37/37 [00:04<00:00, 9.03it/s]
mpl.rcParams.update(mpl.rcParamsDefault)
fig, ax = plt.subplots(figsize=(6,3), dpi=400)
#plt.xticks(rotation='vertical')
ax.bar(documents_per_day.index, [len(doc) for doc in documents_per_day], color= "b")
ax.set_title("Daily length of documents")
ax.axhline(10000, color="r", linestyle="dashed", label="Cutoff")
ax.set_ylabel("Number of tokens (log)")
ax.xaxis.set_major_locator(plt.MaxNLocator(7))
plt.xticks(rotation = 25)
ax.set_yscale('log')
ax.legend()
plt.show()
We disregard the last 3 days in the analysis going forward, partially due to them not living up to the criteria for the methods used, but also due to the difference in comments each day for each candidate, ie. Trump has comments on 2020-11-04, while Biden does not, whereas the reverse is true for 2020-11-04.
We load in the Hedonometer data regarding happiness scores for words and apply it in the analysis of the daily documents in the corpus.
labMT = pd.read_csv(r"C:\Users\JaQtae\Desktop\SocInfo2022\Hedonometer.csv", index_col="Word")
setup_mpl()
def hscore(tokens, p=False):
#Iterative counter
# If we want print p = True.
score = 0
freq_dict = dict(nltk.FreqDist(tokens))
freq_norm = 0
for word in tokens:
# Count for word in dict
try:
score += labMT.loc[[word]]["Happiness Score"].values[0] * freq_dict[word]
freq_norm += freq_dict[word]
except:
#Doesnt exist in labMT
None
hscore = score / (freq_norm+1e-06) #NaN for 0-division
if p==True:
return print("Happiness score: {}".format(hscore))
else:
return hscore
temp = com_data.groupby(['day', 'politician']).tokens.sum()
for i in tqdm(range(len(temp))):
#Removal of webscrape nonsense
c1 = temp[i].count('gt')
c2 = temp[i].count('amp')
k = max(c1, c2)
for j in range(0,k):
try:
temp[i].remove('gt')
temp[i].remove('amp')
except Exception:
pass
c3 = temp[i].count('')
for j in range(0,c3):
try:
temp[i].remove('')
except Exception:
pass
100%|██████████████████████████████████████████████████████████████████████████████████| 72/72 [00:03<00:00, 23.94it/s]
P_daily_hscore = temp.progress_apply(lambda x: hscore(x))
P_daily_hscore.index = temp.index
100%|██████████████████████████████████████████████████████████████████████████████████| 72/72 [07:46<00:00, 6.47s/it]
B_hscores = []
T_hscores = []
for q in range(len(P_daily_hscore)):
if P_daily_hscore.index[q].count('Biden') == 1:
B_hscores.append(P_daily_hscore[q])
elif P_daily_hscore.index[q].count('Trump') == 1:
T_hscores.append(P_daily_hscore[q])
else:
pass
#Sanity check
len(B_hscores) == len(T_hscores)
True
df_B = pd.DataFrame()
df_T = pd.DataFrame()
df_B['B_hscore'] = B_hscores
df_T['T_hscore'] = T_hscores
catcher = []
tester = []
for k in range(len(P_daily_hscore)):
if P_daily_hscore.index[k].count('Biden') == 1:
catcher.append(P_daily_hscore.index[k][0])
elif P_daily_hscore.index[k].count('Trump') == 1:
tester.append(P_daily_hscore.index[k][0])
The values are dropped in correspondence with the previous comments regarding cut-off of the tail end of the dictionary based methodology criterium.
del catcher[-2:]
del tester[-2:]
df_B.drop(df_B.tail(2).index,inplace=True)
df_T.drop(df_T.tail(2).index,inplace=True)
df_B.index = catcher
fig, ax = plt.subplots(figsize=(8,4), dpi=400)
ax.plot(df_B.index, df_B, ls = "--", alpha = 0.5, label='Biden')
ax.plot(df_T.index, df_T, ls = "--", alpha = 0.5, label='Trump')
plt.axvline(x=df_B.index[-1], color='green', alpha = 0.4, label='Election Day')
ax.set_title("Avg. daily happiness ratings \n r/politics submission comments")
ax.set_ylabel("Avg. happiness")
ax.xaxis.set_major_locator(plt.MaxNLocator(8))
plt.xticks(rotation = 25)
plt.ylim(5.1,7.1)
plt.grid(alpha=0.5)
ax.legend()
plt.show()
We chose the election day as the day to remark upon, and compare it to the whole previous part of the dataset to investigate changes of sentiment over time. This amounts to a total of 33 days contained in the reference dictionary.
election_day = P_daily_hscore.index[-5][0]
print('Election day is: {}'.format(election_day))
d_T = datetime.strptime(P_daily_hscore.index[-5][0], '%Y-%m-%d')
d_m = (d_T- dt.timedelta(days=33)).strftime('%Y-%m-%d') #string format
Election day is: 2020-11-03
l = documents_per_day.loc[election_day]
l_ref = np.concatenate(documents_per_day[(documents_per_day.index < election_day) & \
(documents_per_day.index > d_m)].values)
p_l = dict([(item[0], item[1]/len(l)) for item in Counter(l).items()])
p_lref = dict([(item[0], item[1]/len(l)) for item in Counter(l_ref).items()])
sorted(p_lref.items(), key = lambda x:x[1], reverse=True)[:10]
[('trump', 7.258979206049149),
('like', 3.2827977315689982),
('people', 3.255954631379962),
('would', 2.942533081285444),
('get', 2.0862003780718337),
('one', 2.0139886578449904),
('even', 1.662381852551985),
('going', 1.6544423440453686),
('think', 1.6351606805293006),
('know', 1.5497164461247637)]
sorted(p_l.items(), key = lambda x:x[1], reverse=True)[:10]
[('trump', 0.0166351606805293),
('would', 0.010964083175803403),
('like', 0.00945179584120983),
('people', 0.008695652173913044),
('prison', 0.006427221172022685),
('right', 0.005671077504725898),
('go', 0.005671077504725898),
('vote', 0.005671077504725898),
('think', 0.004914933837429111),
('get', 0.004914933837429111)]
So the main topic surrounds trump and people, however on the day of election voting seems to be more significant, understandably so.
# [(token, diff(p_l, p_lref))]
# Do it for every word in the union of sets of words for corpora
d_p = dict([(token, p_l.get(token, 0) - p_lref.get(token, 0)) \
for token in set(p_l.keys()).union(set(p_lref.keys()))])
labMT_dict = pd.Series(labMT["Happiness Score"].values, index=labMT.index).to_dict()
prep_hscore = dict([(token, labMT_dict.get(token, np.nan)-5) for token in set(p_l.keys()).union(set(p_lref.keys()))])
d_phi = dict([(token, prep_hscore[token]*d_p[token]) for token in set(p_l.keys()).union(set(p_lref.keys()))\
if not np.isnan(prep_hscore[token])]) #Do it for everything that isn't NaN
# Absolute value sorting top 10
sorted(d_phi.items(), key = lambda x: np.abs(x[1]), reverse=True)[:10]
[('like', -7.26682797731569),
('people', -3.7668204158790175),
('covid', 3.1454820415879015),
('shit', 2.4234404536862004),
('good', -2.278185255198488),
('trump', -2.1727032136105846),
('right', -2.033149338374291),
('think', -1.9562948960302462),
('get', -1.914782608695652),
('well', -1.8464120982986765)]
The $\delta\phi$ value identifies the extent a given set of words contribute to the difference in scores and how they do it. Either a word is used more often in one text over the other and its influence, positive or negative, is weighted in the relative frequency. This matches very well with the Word Cloud representation of the texts.
senti_shift = sft.WeightedAvgShift(type2freq_1 = p_lref,
type2freq_2 = p_l,
type2score_1 = labMT_dict,
reference_value = 5)
senti_shift.get_shift_graph(detailed = True,
system_names = ['ref', 'd'],
top_n=25)
#filename='shifteratorGraph.svg')
C:\Users\JaQtae\anaconda3\lib\site-packages\shifterator\plotting.py:604: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(x_ticks, fontsize=plot_params["xtick_fontsize"])
<Figure size 640x480 with 0 Axes>
An overall negative net change in $\phi_{avg}$ due to top words.
url_conviction = "https://raw.githubusercontent.com/JaQtae/SocInfo2022/FinalProject/Data/Saved dataframes etc/author_conviction_df.csv"
download_conviction = requests.get(url_conviction).content
author_conviction = pd.read_csv(io.StringIO(download_conviction.decode('utf-8')), parse_dates = ['Unnamed: 0'], sep=',').set_index('Unnamed: 0')
author_conviction.index = author_conviction.index.rename('index')
edges = com_data.drop(['id', 'link_id', 'parent_id','body','tokens','politician','children_comments','mentions_Trump', 'mentions_Biden', 'compound_sentiment_score'], axis=1)
edges = com_data.groupby(['author', 'parent_author']).count()
edges
In order to model the interactions of the redditors with one another, we constructed an undirected network based on their comments and replies, where all nodes were the authors. We then created reciprocal edges such that each edge represented that the two authors had replied to the other at least once. Aftewards we assigned an edge weight based on the total number of comments the two authors had written to one another. Finally we removed all singleton nodes and self-loops from the graph. These methods were chosen as we would like to focus on discussions where both users interacted with one another and contributed to a dialogue in order to make some conclusion about political discussions online. In that case making the network undirected would simplify it and make it easier to plot.
# Directed graph.
edges = com_data.drop(['id', 'link_id', 'parent_id','body','tokens','politician','children_comments','mentions_Trump', 'mentions_Biden', 'compound_sentiment_score'], axis=1)
edges = com_data.groupby(['author', 'parent_author']).count()
G = nx.DiGraph()
# Add all edges and remake to reciprocal.
G.add_weighted_edges_from([ (a, b, edges['score'].loc[(a,b)]) for a, b in edges.index])
G_U = G.to_undirected(reciprocal=True)
for node, ngbr, w in G_U.edges.data('weight'):
# Undirected edge weights equal to sum of weights in two corresponding directed edges
G_U.edges[node, ngbr]['weight'] = (G.edges[node, ngbr]['weight'] + G.edges[ngbr, node]['weight'])
# Finally, remove isolated nodes and self-loops in graph.
G_U.remove_edges_from(nx.selfloop_edges(G_U))
G_U.remove_nodes_from(list(nx.isolates(G_U)))
Original_G_U_edges = G_U.edges()
print('Nodes: {}'.format(G_U.number_of_nodes()))
print('Edges: {}'.format(G_U.number_of_edges()))
Nodes: 1346 Edges: 3395
author_list = list(author_conviction['author'].values)
node_list = list(G_U.nodes())
for i in tqdm(range(len(node_list))):
if node_list[i] in author_list:
G_U.nodes[node_list[i]]['conviction'] = list(author_conviction.loc[author_conviction['author'] == node_list[i]]['supporter_of'].values)[0]
else:
G_U.nodes[node_list[i]]['conviction'] = 'Unknown'
100%|████████████████████████████████████████████████████████████████████████████| 1346/1346 [00:00<00:00, 2435.84it/s]
Trump_nodes = [n for n,v in G_U.nodes(data=True) if v['conviction'] == 'Trump']
Biden_nodes = [n for n,v in G_U.nodes(data=True) if v['conviction'] == 'Biden']
print('Conviction attributes of nodes:')
print('Trump: {} // Biden: {}'.format(len(Trump_nodes),len(Biden_nodes)))
Conviction attributes of nodes: Trump: 474 // Biden: 872
One of our goals was to investigate if authors primarily interacted with other users of similar political conviction. Therefore as a preliminary analysis we calculate the fraction of edge weights between authors with the same political conviction. We find that around 56% of the comments were written to a user that shared the political conviction of the author. This could suggest that the users communicate slightly more often with people that share their political belief, as this might be easier and leads to a less polarizing discussion. However they still interact relatively broadly - often with someone they disagree with politically - as political disagreement might give users a larger incentive to reply to a comment. It is also important to keep in mind that there are some shortcomings in our assumptions of the users opinions however, as it was inferred from the sentiment and do not represent a ground truth. Therefore they might be different from the users' true political opinions which might cause some shortcomings for our later analysis.
interactions_with_likeminded_users = []
total_interactions = []
for i in tqdm(range(len(node_list))):
own_conviction = G_U.nodes[node_list[i]]['conviction']
neighbor_list = [n for n in G_U.neighbors(node_list[i])]
for neighbor in neighbor_list:
neighbor_conviction = G_U.nodes[neighbor]['conviction']
if neighbor_conviction == own_conviction:
interactions_with_likeminded_users.append(list(G.edges()[(node_list[i], neighbor)].values())[0])
total_interactions.append(list(G.edges()[(node_list[i], neighbor)].values())[0])
else:
total_interactions.append(list(G.edges()[(node_list[i], neighbor)].values())[0])
# print(list(G.edges()[(node_list[i], neighbor)].values())[0])
#print(len(total_interactions))
#print(sum(interactions_with_likeminded_users))
print(sum(interactions_with_likeminded_users)/sum(total_interactions))
100%|███████████████████████████████████████████████████████████████████████████| 1346/1346 [00:00<00:00, 36443.25it/s]
0.565470417070805
We then visualised the network using the Netwulf package. Nodes with the political conviction attribute "Trump" were colored red while nodes with the attribute "Biden" are blue. We also varied the size of the nodes proportional to their strength, which in this case was the sum of the weights of the node's edges. Lastly we displayed the names of the 8 users with the largest degree as text labels.
for i in tqdm(range(len(node_list))):
conviction = G_U.nodes[node_list[i]]['conviction']
if conviction == 'Trump':
G_U.nodes[node_list[i]]['color'] = '#FF0000'
if conviction == 'Biden':
G_U.nodes[node_list[i]]['color'] = '#0000FF'
largest_degree_labels = []
largest_degree_nodes = sorted(G_U.degree, key=lambda x: x[1], reverse=True)[:8]
for i in range(len(largest_degree_nodes)):
largest_degree_labels.append(largest_degree_nodes[i][0])
#G_U = nx.read_gexf("com_network.gexf")
plt.style.use('default')
configuration = {"zoom": 1.15, "node_charge": -30, "node_gravity": 0.4,
"link_distance": 15, "link_distance_variation": 0, "node_collision": True,
"wiggle_nodes": False, "freeze_nodes": False,
"node_fill_color": '#79aaa0', "node_stroke_color": "#555555",
"node_label_color": "#000000","node_size": 10,"node_stroke_width": 0.4,
"node_size_variation": 0.8, "label_size": 10, "display_node_labels": False,
"scale_node_size_by_strength": True,"link_color": "#7c7c7c", "link_width": 2,
"link_alpha": 0.5, "link_width_variation": 0.5, "display_singleton_nodes": True,
"min_link_weight_percentile": 0,"max_link_weight_percentile": 1}
network, config = nw.visualize(G_U, plot_in_cell_below=False, config = configuration)
fig, ax = nw.draw_netwulf(network, figsize = 20)
for label in largest_degree_labels:
nw.add_node_label(ax,network,label, dy = 15, fontsize = 27)
#plt.savefig('politics_network.svg', bbox_inches='tight')
#with plt.style.context('classic'):
# network, config = nw.interactive.visualize(nw.get_filtered_network(G_U, node_group_key='conviction'), plot_in_cell_below=False)
# fig, ax = nw.draw_netwulf(network, config = configuration)
100%|██████████████████████████████████████████████████████████████████████████| 1346/1346 [00:00<00:00, 224939.56it/s]
From the graph we observe that the red and blue nodes are quite interconnected, and there doesn't seem to be two distinct communities for the supporters of Trump and Biden. This supports our calculations from earlier, but is also investigated in the later analysis. In some cases we observe a number of interconnected blue nodes, but these hubs might also appear because there are significantly more blue 'Biden' nodes than red nodes, 872 vs 474 respectively.
As seen in the figure, most of the nodes only have a few edges, while some nodes have a large degree. This might correspond to a few popular authors that interact mutually with many of the users. Most of the users only have a single mutual social connection. This is common for a real social network and will be investigated further in the next section.
As a part of our analysis we investigated the degree distribution of the authors in our Reddit network, by plotting the probability density for the degrees in a histogram. This was also done to see if our distribution is similar to the ones encountered in other social networks.
degReddit = [x[1] for x in G_U.degree]
print('minDeg: {} | maxDeg: {}'.format(min(degReddit), max(degReddit)))
bins = np.linspace(0, 51, 22)
hist, edges = np.histogram(degReddit, bins=bins, density=True)
x = (edges[1:]+edges[:-1])/2
width = bins[1] - bins[0]
#plt.style.use('dark_background')
fig, ax = plt.subplots()
plt.title("Degree distribution for Reddit network")
ax.bar(x, hist, width=width)
ax.set_ylabel("probability density")
ax.set_xlabel("degrees")
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
plt.savefig('degree_distribution_for_Reddit_network.svg', bbox_inches='tight')
minDeg: 1 | maxDeg: 50
It is observed that the probability mass is the highest for small degrees but then drops off rapidly for a larger amount of degrees. This could have the appearance of a power-law distribution, which is common for real social networks, and also supports what is mentiond Chapter 3 in the NS book. As a reference we compare it a similar randomly constructed Erdos Renyi network. The random network contains the same number of nodes and links, where the probability of two nodes sharing an edge is constant and has been derived from formula 3.3 in Chapter of the NS book.
L = len(G_U.edges)
N = len(G_U.nodes)
p = 2*L/(N*(N-1))
k = 2*L/N
print('N: {} | p: {:.3f}% | k: {:.3f}'.format(N, p * 100, k))
N: 1346 | p: 0.375% | k: 5.045
Random_graph = nx.generators.random_graphs.erdos_renyi_graph(N,p, seed = 42)
L = len(Random_graph.edges)
N = len(Random_graph.nodes)
p = 2*L/(N*(N-1))
k_random = 2*L/N
print('N: {} | p: {:.3f}% | k: {:.3f}'.format(N, p * 100, k_random))
plt.style.use('default')
degrees = [x[1] for x in Random_graph.degree]
print('minDeg: {} | maxDeg: {}'.format(min(degrees), max(degrees)))
bins = np.linspace(0, 14, 14)
hist_random, edges = np.histogram(degrees, bins=bins, density=True)
x_random = (edges[1:]+edges[:-1])/2
width = bins[1] - bins[0]
#plt.style.use('dark_background')
fig, ax = plt.subplots()
plt.title("Degree distribution for random network")
ax.bar(x_random, hist_random, width=width)
ax.set_ylabel("probability density")
ax.set_xlabel("degrees")
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
plt.savefig('degree_distribution_for_random_network.svg', bbox_inches='tight')
minDeg: 0 | maxDeg: 13
The degree distribution for the random network is therefore evidently binomial and peaks close to the mean degree of 5.045.
We then plot the degree distribution of our Reddit network and the random network together in a log-scale plot to compare them. As the mean degree for the random network is nearly identical to that of the Reddit network, we just included the mean degree of the reddit network in the plot
#Degree distribution for the reddit network:
degReddit = [x[1] for x in G_U.degree]
print('minDeg: {} | maxDeg: {}'.format(min(degReddit), max(degReddit)))
bins_reddit = np.logspace(0, np.log10(80), 14)
hist_reddit, edges_reddit = np.histogram(degReddit, bins=bins_reddit, density=True)
x_reddit = (edges_reddit[1:]+edges_reddit[:-1])/2
fig, ax = plt.subplots()
xx, yy = zip(*[(i,j) for (i,j) in zip(x_reddit, hist_reddit) if j > 0])
#The two distributions as log scale:
xx_random, yy_random = zip(*[(i,j) for (i,j) in zip(x_random, hist_random) if j > 0])
x, hist = zip(*[(i,j) for (i,j) in zip(x, hist) if j > 0])
plt.plot(xx, yy, marker='.', label="Reddit network")
plt.plot(xx_random, yy_random, marker='.', color='magenta', label="Random network")
plt.title("Probability density distribution for degrees in Reddit and random network")
ax.axvline( (k+k_random)/2 , linestyle='--', color='r', label="Mean degree for both networks")
plt.ylabel("probability density")
plt.xlabel("degree")
plt.legend()
plt.yscale('log')
plt.xscale('log')
plt.xlim((0.5, 80))
#plt.savefig('probability_density_distibution_degrees_for_Reddit_and_random_network.svg', bbox_inches='tight')
minDeg: 1 | maxDeg: 50
For the random network the probability mass is condensed at a smaller area with the density dropping of significantly after its peak around degree 5. Also it is interesting to observe that no nodes have a degree larger than 13.
On the other hand the distribution for the real Reddit network is significantly more heavy-tailed with the largest observed degree being 51. As mentioned earlier the probability density of the degrees does not follow a poisson or binomial distribution, but instead approximately follows a power-law distribution. All of this was done to further examine the interactions between the users and we can conclude that the networks degree distribution more commonly associated with real social networks and is significantly different from that of a random network, which is intuitive.
degReddit = [x[1] for x in G_U.degree]
print('minDeg: {} | maxDeg: {}'.format(min(degReddit), max(degReddit)))
bins_reddit = np.logspace(0, np.log10(80), 14)
hist_reddit, edges_reddit = np.histogram(degReddit, bins=bins_reddit, density=True)
x_reddit = (edges_reddit[1:]+edges_reddit[:-1])/2
fig, ax = plt.subplots()
xx, yy = zip(*[(i,j) for (i,j) in zip(x_reddit, hist_reddit) if j > 0])
ax.plot(xx, yy, marker='.', label = "Degrees")
plt.title("Probability density distribution for degrees in Reddit network")
ax.axvline(k, linestyle='--', color='r', label="Mean degree")
ax.set_ylabel("probability density")
ax.set_xlabel("degree")
ax.set_yscale('log')
ax.set_xscale('log')
plt.legend()
plt.savefig('probability_density_distibution_degrees_for_Reddit_network.svg', bbox_inches='tight')
minDeg: 1 | maxDeg: 50
Next we calculated the clustering coefficient for our network. This measures the density of links in node i’s immediate neighborhood on a scale from 0 to 1, for instance Ci = 0 would mean that there are no links between i’s neighbors; Ci = 1 implies that each of the i’s neighbors link to each other to investigate the general interconnectedness. Therefore it would tell us if one author just interacted with number of "seperated" users or if the hub is more akin to a "community" where the neighbors also reply to each other.
The clustering coefficient is first calculated using the left-hand side of formula 3.21 in Section 3 of the NS book and the method provided in the networkx package. It was then compared with the approximative formula Ci =
C = []
for node in G_U.nodes():
neighbors = G_U.neighbors(node)
subgraph_neighbors = G_U.subgraph(neighbors).copy()
L_i = len(subgraph_neighbors.edges)
k_i = G_U.degree(node)
#We use the formula 3.21
C.append((2*L_i) / (k_i * (k_i - 1) + 1e-6)) #Few nodes only have k_i = 1
lst = list(nx.clustering(G_U).values())
lst = [float(i) for i in lst]
np.mean(C), np.mean(lst), np.mean(degReddit)/len(degReddit)
(0.016888500959335504, 0.016888504285936627, 0.0037478280260261542)
We observe that the real clustering coefficient in the network is significantly higher than the one approximated with the right hand side of formula 3.21, Ci =
C = []
for node in Random_graph.nodes():
neighbors = Random_graph.neighbors(node)
subgraph_neighbors = Random_graph.subgraph(neighbors).copy()
L_i = len(subgraph_neighbors.edges)
C.append((2*L_i) / (Random_graph.degree(node) * (Random_graph.degree(node) - 1) + 1e-6)) #Few notes have
lst = list(nx.clustering(Random_graph).values())
lst = [float(i) for i in lst]
np.mean(lst), np.mean(C), np.mean(degrees) / len(degrees)
(0.003156324441614189, 0.0031563242780291664, 0.0037522437291496015)
First of all the clustering coefficient is generally significantly lower, and the approximation is also closer to the real clustering coefficient. All of this signifies that our network follows the properties of a real social network and not those of a random network.
#Get all components
CCs_reddit = [len(c) for c in sorted(nx.connected_components(G_U), key=len, reverse=True)]
largest_cc_reddit = max(nx.connected_components(G_U), key=len)
print(f'Number of components: {len(CCs_reddit)}')
print(f'The largest connected component contains: {len(largest_cc_reddit)}')
Number of components: 4 The largest connected component contains: 1340
S_reddit = G_U.subgraph(largest_cc_reddit).copy()
print(f'The average shortest path reddit: {nx.average_shortest_path_length(S_reddit)}')
The average shortest path reddit: 4.253025759923311
This means that an author on average can reach any other author in around 4 or 5 links (if they are connected). So we assume that our network follows the 'small world property'
We wanted to investigate if authors with the conviction of 'Trump' and 'Biden' can be seperated into two different communities within the graph. We use modularity as a measure in order to evaluate the partitioning. It measures how much the network deviates from the expected number of edges between nodes in a community if all of its wirings had been randomly constructed. Thus, higher modularity for a given network partition indicates better community structure, meaning that separate communities are only sparsely connected compared to the densely connected individual communities.
trump = {key for (key, value) in dict(nx.get_node_attributes(G_U, "conviction")).items() if value == 'Trump'}
biden = {key for (key, value) in dict(nx.get_node_attributes(G_U, "conviction")).items() if value == 'Biden'}
G_U_modularity = round(nx_comm.modularity(G_U, [trump, biden]),5)
G_U_copy = G_U
#print(G_U_copy)
#print(G_U_copy == G_U)
#print(G_U.edges() == G_U_copy.edges())
G_U_modularity
#At fordele dem vha. politiske holdninger svarer næsten til random
0.01747
Based on section 9 in the NC book, the connectivity between the nodes in the communities is therefore assumed to be close to random and explained by the degree distribution. Additionaly it is not possible to find a good cutoff with few connections in order to seperate the 'Trump' and 'Biden' nodes. Therefore it is unlikely that a 'Trump' and 'Biden' community exists in the graph. This means that users with different political conviction interact often, and our findings support our previous calculations.
We then use the Louvain algorithm to find the best partitioning of the graph. This is in order to examine the communities that exist in the network and their composition of 'Trump' and 'Biden' nodes.
G_U = G_U_copy
partition = community_louvain.best_partition(G_U, random_state = 42)
partition_dict = {item: set() for (key, item) in partition.items()}
#partition_dict # There are 4 communities given louvain
# For Louvain vs Our own
for (key, item) in partition.items():
partition_dict[item].add(key)
louvain_modularity = nx_comm.modularity(G_U, list(partition_dict.values()))
print(len(partition_dict))
louvain_modularity
#Meget højere modularity end ved Rep/Dem split
24
0.5374311028054048
The Louvain algorithm found 24 communities in the network with a modularity of 0.537. This suggests a stronger community structure and that it is more unlikely the edges between the nodes in the communities have been created at random.
#print(G_U.edges == Original_G_U_edges)
cc = []
for i in range(len(partition_dict)):
cc.append(len(partition_dict[i]))
print('Largest community size: {}'.format(max(cc)))
Largest community size: 218
We then visualise the largest community using the Netwulf package.
G_U = G_U_copy
H_U = nx.subgraph(G_U, list(partition_dict[13]))
H_U
configuration = {"zoom": 1.3, "node_charge": -100, "node_gravity": 0.32,
"link_distance": 15, "link_distance_variation": 0, "node_collision": True,
"wiggle_nodes": False, "freeze_nodes": False,
"node_fill_color": '#79aaa0', "node_stroke_color": "#555555",
"node_label_color": "#000000","node_size": 10,"node_stroke_width": 0.4,
"node_size_variation": 0.8, "label_size": 10, "display_node_labels": False,
"scale_node_size_by_strength": True,"link_color": "#7c7c7c", "link_width": 2,
"link_alpha": 0.5, "link_width_variation": 0.5, "display_singleton_nodes": True,
"min_link_weight_percentile": 0,"max_link_weight_percentile": 1}
largest_degree_labels = []
largest_degree_nodes = sorted(H_U.degree, key=lambda x: x[1], reverse=True)[:5]
for i in range(len(largest_degree_nodes)):
largest_degree_labels.append(largest_degree_nodes[i][0])
plt.style.use('default')
network, config = nw.visualize(H_U, plot_in_cell_below=False, config = configuration)
fig, ax = nw.draw_netwulf(network, figsize = 20)
for label in largest_degree_labels:
nw.add_node_label(ax,network,label, dy = 15, fontsize = 27)
plt.savefig('politics_network_largest_community.svg', bbox_inches='tight')
We observe that the community contains a large number of both red red and blue nodes and that they are relatively interconnected. This suggests that users often interact with others that do not share their political belief, which also supports the general trend observed earlier.
Lastly we examined if the found modularities were statistically different from 0? We did this by creating 1000 randomized versions of the Reddit network using the DSE algorithm. Then we started by computing the modularity of the "trump/biden" split for for each of them.
random_kcnets_original = []
n = 1000
for _ in tqdm(range(n)):
G_new = nx.double_edge_swap(G_U, nswap = G_U.number_of_edges(),max_tries = G_U.number_of_edges()*100)
random_kcnets_original.append(nx_comm.modularity(G_new, [trump, biden]))
print('Avg. Modularity: {} \nStandard deviation: {}'.format(round(np.mean(random_kcnets_original),6),
round(np.std(random_kcnets_original),6)))
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:12<00:00, 13.73it/s]
Avg. Modularity: -0.000728 Standard deviation: 0.007873
After the random rewiring, the mean modularity is centered around 0. Therefore they don't represent potential communities. This is also expected from a randomly wired network, as described in the theory.
fig, ax = plt.subplots(figsize=(10,5), dpi=400)
ax.hist(random_kcnets_original, label="Empirical distribution", bins=10)
ax.axvline(G_U_modularity, linestyle='--', color='r', label="TRUMP/BIDEN modularity")
ax.set_title("Empirical modularity distribution of randomly configured models")
ax.set_ylabel("Count")
ax.set_xlabel("Modularity")
ax.legend()
plt.savefig('Empirical modularity distribution of randomly configured models_trump_biden.svg', bbox_inches='tight')
Looking at the plot, the modularity of the trump/biden split lies within the emirical modularity distribution for the randomly altered models. This indicates that the edges between the communities of 'trump' and 'biden' are explained by the degree distribution and the connectivity is close to that of a random graph. Therefore the trump/biden split is not a good partitioning as its modularity is not statistically significantly different from 0.
We then run the same experiment for the Louvain partitioning:
G_U = G_U_copy
random_kcnets_louvain = []
n = 1000
for _ in tqdm(range(n)):
G_new = nx.double_edge_swap(G_U, nswap = G_U.number_of_edges(),max_tries = G_U.number_of_edges()*100)
random_kcnets_louvain.append(nx_comm.modularity(G_new, list(partition_dict.values())))
print('Avg. Modularity: {} \nStandard deviation: {}'.format(round(np.mean(random_kcnets_louvain),6),
round(np.std(random_kcnets_louvain),6)))
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:13<00:00, 13.53it/s]
Avg. Modularity: -0.001377 Standard deviation: 0.004507
fig, ax = plt.subplots(figsize=(10,5), dpi=400)
ax.hist(random_kcnets_louvain, label="Empirical distribution", bins=10)
ax.axvline(louvain_modularity, linestyle='--', color='r', label="Louvain communities modularity")
ax.set_title("Empirical modularity distribution of randomly configured models")
ax.set_ylabel("Count")
ax.set_xlabel("Modularity")
ax.legend()
plt.savefig('Empirical modularity distribution of randomly configured models_louvain.svg', bbox_inches='tight')
As seen in the plot, none of the 1000 randomly configured networks had a modularity higher than the one observed in the split suggested by the Louvain algorithm. This suggests there are fewer edges between the Louvain communities than one would expect using a randomly connected graph. Therefore the splits calculated by the Louvain algorithm is a good partitioning, and we can conclude that its modularity is statistically significantly different from 0.
We're quite happy with our choice of topic for the project. Also, it was great to work with Github Pages, which was a first for all members of the group.
It would undoubtedly have been interesting to investigate the political content on r/politics over a longer time period, e.g. six months preceeding the election day, which would allow for the detection of longer term trends in redditor activity and sentiment. For such a scope to be feasible a larger commitment to computational power would be required than the project had available. Along that train of thought, incorperating other US related politics subreddits into the analysis may give a more broad description, and in-depth for the edge cases of extremism on either side of the political spectrum.
Since the submissions on r/politics are links to different articles from many news sources, magazines, etc. one may delve into the specific sentiments used in the articles, and how they relate to the sentiments of the comments, ie. is a Fox News article surrounded primarily by Trump-supporters or his critics? This would expand the field of research quite a bit.
Furthermore, improving on the text analysis with regards to the contents of the comments in a more informed way, such as accounting for bias, sarcasm, et cetera would be advised.
Another possibility would be to create a model through bi-grams and Naïve-Bayes classification in order to predict polling data using machine learning.